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We address the issue of variable selection in the regression model with 
very high ambient dimension, i.e., when the number of variables is very 
large. The main focus is on the situation where the number of relevant vari- 
ables, called intrinsic dimension and denoted by d*, is much smaller than 
the ambient dimension d. Without assuming any parametric form of the 
underlying regression function, we get tight conditions making it possible 
to consistently estimate the set of relevant variables. These conditions re- 
late the intrinsic dimension to the ambient dimension and to the sample 
size. The procedure that is provably consistent under these tight condi- 
tions is based on comparing quadratic functionals of the empirical Fourier 
coefficients with appropriately chosen threshold values. 

The asymptotic analysis reveals the presence of two quite different re- 
gimes. The first regime is when rf* is fixed. In this case the situation in non- 
parametric regression is the same as in linear regression, i.e., consistent 
variable selection is possible if and only if log d is small compared to the 
sample size n. The picture is different in the second regime, d* ^ oo as 
« ^ 00, where we prove that consistent variable selection in nonparamet- 
ric set-up is possible only if d* -hloglogd is small compared to logn. We 
apply these results to derive minimax separation rates for the problem of 
variable selection. 



1 . Introduction. Real-world data such as those obtained from neuroscience, 
chemometrics, data mining, or sensor-rich environments are often extremely 
high-dimensional, severely underconstrained (few data samples compared to 
the dimensionality of the data), and interspersed vnth a large number of irrel- 
evant or redundant features. Furthermore, in most situations the data is con- 
taminated by noise making it even more difficult to retrieve useful information 
from the data. Relevant variable selection is a compelling approach for address- 
ing statistical issues in the scenario of high-dimensional and noisy data with 
small sample size. Starting from Mallows [29], Akaike [1], Schwarz [36] who in- 
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troduced respectively the famous criteria Cp, AIC and BIC, the problem of vari- 
able selection was extensively studied in the statistical and machine learning 
literature both from the theoretical and algorithmic viewpoints. It appears, how- 
ever, that the theoretical limits of performing variable selection in the context of 
nonparametric regression are still poorly understood, especially when the num- 
ber of variables, denoted by d and referred to as ambient dimension, is much 
larger than the sample size n. The purpose of the present work is to explore this 
setting under the assumption that the number of relevant variables, hereafter 
called intrinsic dimension and denoted by d*, may grow with the sample size 
but remains much smaller than d. 

In the important particular case of linear regression, the latter scenario was 
the subject of a number of recent studies. Many of them rely on £i -norm penal- 
ization [38, 46, 31] and constitute an attractive alternative to iterative variable 
selection procedures [2, 45] and to marginal regression or correlation screening 
[42, 18]. Promising results for feature selection are also obtained by conformal 
prediction [20], (minimax) concave penalties [16, 17, 44], Bayesian approach [37] 
and higher criticism [15]. Extensions to other settings including logistic regres- 
sion, generalized linear model and Ising model were carried out in [8, 34, 18], 
respectively. Variable selection in the context of groups of variables with disjoint 
or overlapping groups was studied by [43, 24, 28, 32, 21] . Hierarchical procedures 
for selection of relevant variables were proposed by [3, 5, 47]. 

It is now well understood that in the Gaussian sequence model and in the 
high-dimensional linear regression with a Gram matrix satisfying some vari- 
ant of irrepresentable condition, consistent estimation of the pattern of rele- 
vant variables — also called the sparsity pattern — is possible under the condi- 
tion d*log{d/d*] = o{n] as n ^ oo [41]. Furthermore, it is well known that if 
{d* log{d /d*)]/n remains bounded from below by some positive constant when 
n ^ 00, then it is impossible to consistently recover the sparsity pattern [40]. 
Thus, a tight condition exists that describes in an exhaustive manner the inter- 
play between the quantities d*, d and n that guarantees the existence of consis- 
tent estimators. The situation is very different in the case of non-linear regres- 
sion, since, to our knowledge, there is no result providing tight conditions for 
consistent estimation of the sparsity pattern. 

The papers [26] and [4] , closely related to the present work, considered the 
problem of variable selection in nonparametric Gaussian regression model. They 
proved the consistency of the proposed procedures under some assumptions 
that — in the light of the present work — turn out to be suboptimal. More pre- 
cisely, Lafferty and Wasserman [26] assumed the unknown regression function 
to be four times continuously differentiable with bounded derivatives. The algo- 
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rithm they proposed, termed Rodeo, is a greedy procedure performing simulta- 
neously local bandwidth choice and variable selection. Rodeo is shown to con- 
verge when the ambient dimension d is 0(log n /loglog n) while the intrinsic di- 
mension d* does not increase with n. On the other hand, Bertin and Lecue [4] 
proposed a procedure based on the -penalization of local polynomial estima- 
tors and proved its consistency when d* = 0(1) but d is allowed to be as large as 
logn, up to a constant. They also have a weaker assumption on the regression 
function merely assumed to belong to the Holder class with smoothness p > I. 
To complete the picture, let us mention that estimation and hypotheses testing 
problems for high-dimensional nonparametric regression under sparse additive 
modeling were recently addressed in [25, 33, 19]. 

This brief review of the literature reveals that there is an important gap in 
consistency conditions for the linear regression and for the non-linear one. For 
instance, if the intrinsic dimension d* is fixed, then the condition guaranteeing 
consistent estimation of the sparsity pattern is (log )/ n ^ in linear regression 
whereas it is = 0(log n) in the nonparametric case. While it is undeniable that 
the nonparametric regression is much more complex than the linear one, it is 
however not easy to find a justification to such an important gap between two 
conditions. The situation is even worse in the case where d* oo. In fact, for 
the linear model with at most polynomially increasing ambient dimension d = 
0(n*^), it is possible to estimate the sparsity pattern for intrinsic dimensions d* 
as large as n^^^ , for some e > 0. In other words, the sparsity index can be almost 
on the same order as the sample size. In contrast, in nonparametric regression, 
there is no procedure that is proved to converge to the true sparsity pattern when 
both n and d* tend to infinity, even if d* grows extremely slowly. 

In the present work, we fill this gap by introducing a simple variable selection 
procedure that selects the relevant variables by comparing some quadratic func- 
tionals of empirical Fourier coefficients to prescribed significance levels. Con- 
sistency of this procedure is established under some conditions on the triplet 
{d*, d, n) and the tightness of these conditions is proved. The main take-away 
messages deduced from our results are the following: 

• When the number of relevant variables d* is fixed and the sample size n 
tends to infinity, there exist positive real numbers c* and c* such that (a) if 
{\o%d)ln < the estimator proposed in Section 3 is consistent and (b) no 
estimator of the sparsity pattern may be consistent if (logrf)/ n > c*. 

• When the number of relevant variables d* tends to infinity with n ^ oo, 
then there exist real numbers c and c,- , f = 1, 2 such that c j > 0, Ci > and 
(a) if c J d*+\og\og[d / d*)—\o% n the estimator proposed in Section 3 is 
consistent and (b) no estimator of the sparsity pattern may be consistent 
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if cid* + loglog{d / d*] — logn > cz- 
• In particular, if d grows not faster than a polynomial in n, then there exist 
positive real numbers Cq and such that (a) if d* < Cologn the estimator 
proposed in Section 3 is consistent and (b) no estimator of the sparsity 
pattern may be consistent if d* > c° log n. 

In the regime of a growing intrinsic dimension (i* — > oo and a moderately large 
ambient dimension d = 0{n'^), for some C > 0, we make a concentrated effort 
to get the constant co as close as possible to the constant c°. This goal is reached 
for the model of Gaussian white noise and, very surprisingly, it required from us 
to apply some tools from complex analysis, such as the Jacobi -function and 
the saddle point method, in order to evaluate the number of lattice points lying 
in a ball of an Euclidean space with increasing dimension. 

The rest of the paper is organized as follows. The notation and assumptions 
necessary for stating our main results are presented in Section 2. In Section 3, an 
estimator of the set of relevant variables is introduced and its consistency is es- 
tablished, in the case where the data come from the Gaussian white noise model. 
The main condition required in the consistency result involves the number of 
lattice points in a ball of a high-dimensional Euclidean space. An asymptotic 
equivalent for this number is presented in Section 4. Results on impossibility 
of consistent estimation of the sparsity pattern are derived in Section 5. Sec- 
tion 6 is devoted to exploring adaptation to the unknown parameters (smooth- 
ness and degree of significance) and recovering minimax rates of separation. 
Then, in Section 7, we show that some of our results can be extended to the 
model of nonparametric regression. The relations between consistency and in- 
consistency results are discussed in Section 8. The technical parts of the proofs 
are postponed to the Appendix. 

2. The problem formulation and the assumptions. We are interested in 
the variable selection task (also known as model selection, feature selection, 
sparsity pattern estimation) in the context of high-dimensional non-linear re- 
gression. Let f : [0, 1]'' ^ R denote the unknown regression function. We assume 
that the number of variables d is very large, possibly much larger than the sam- 
ple size n, but only a small number of these variables contribute to the fluctua- 
tions of the regression function f . 

To be more precise, we assume that for some small subset / of the index set 
{!,..., d] satisfying Card(/) < d*, there is a function f : IRCard(/) _^ ^ s^^,]^ i-j^^t 

f{x) = f{xj), VjcgM'', 

where x j stands for the subvector of x obtained by removing from x all the co- 
ordinates with indices lying outside /. In what follows, we allow d and d* to 
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depend on n but we will not always indicate this dependence in notation. Note 
also that the genuine intrinsic dimension is Card(/); d* is merely a known upper 
bound on the intrinsic dimension. In what follows, we use the standard notation 

for the vector and sequence norms: 

\\x\\o = Y,^{xj^O), ||x||P=^|xy|'', Vpe[l,oo), ||x||oo = sup|X;-|, 

;' ;' ■' 

for every x G M'' or x e R^. 

Let us stress right away that the primary aim of this work is to understand 
when it is possible to estimate the sparsity pattern / (with theoretical guarantees 
on the convergence of the estimator) and when it is impossible. The estimator 
that we will define in next sections is intended to show the possibility of consis- 
tent estimation, rather than to provide a practical procedure for recovering the 
sparsity pattern. Therefore, the estimator will be allowed to depend on different 
constants appearing in conditions imposed on the regression function f and on 
some characteristics of the noise. 

To make the consistent estimation of the set / realizable, we impose some 
smoothness and identifiability assumptions on f . In order to describe the smooth- 
ness assumption imposed on f , let us introduce the trigonometric Fourier basis: 
ipo = l and 

^ ^ fy2cos(27ifc-x), JfcG(Z^)+, 
wuix] = ■{ (1) 
^ [^^2smi2nk■x), -fcG(Z'^)+, 

where (Z'* )+ denotes the set of all fc G Z'^ \ {0} such that the first nonzero element 

of k is positive and k x stands for the usual inner product in M^. In what follows, 
we use the notation (•,•) for designing the scalar product in L'^([0, 1]'^;M.), that is 
(h,h) = r „, h(x)h(x)rfx for every h,h G L^([0, 1]^;M). Using this orthonormal 
Fourier basis, we define 

^^ = {f- Z;t.z. ' ^l^^'^^' Vj G {1, . . . , . 

To ease notation, we set 0fc[f] = (f, Vfc) for all G Z**. In addition to the smooth- 
ness, we need also to require that the relevant variables are sufficiently relevant 
for making their identification possible. This is done by means of the following 
condition. 

[C1(k-, L)] The regression function f belongs to E/,. Furthermore, for some sub- 
set /c{l,...,rf}of cardinality < d*, there exists a function f : RCardC/) _^ ^ 
such that f(x) = f(x /), Vx G and it holds that 

Q;[f]= X dk[ff>^>^j^J- (2) 
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One easily checks that Qj [f ] = for every j that does not lie in the sparsity pat- 
tern. This provides a characterization of the sparsity pattern as the set of indices 
of nonzero coefficients of the vector Q [f ] — (Qi [f ],..., Qd [f] )• 

Prior to describing the procedures for estimating /, let us comment Con- 
dition [CI]. It is important to note that the identifiability assumption (2) can 
be rewritten as j^^ (f(x) — i{x)dxj)^dx > k and, therefore, is not intrin- 
sically related to the basis we have chosen. In the case of continuously differ- 
entiable and 1 -periodic function f, the smoothness assumption f e as well 
can be rewritten without using the trigonometric basis, since Xitez'' ^j^k i^V = 
(27r)~2 Jjij [dji{x)]^ dx. Thus, condition [CI] is essentially a constraint on the 
function f itself and not on its representation in the specific basis of trigonomet- 
ric functions. 

The results of this work can be extended with minor modifications to other 
types of smoothness conditions imposed on f , such as Holder continuity or Besov- 
regularity. In these cases the trigonometric basis (1) should be replaced by a 
basis adapted to the smoothness condition (spline, wavelet, etc.). Furthermore, 
even in the case of Sobolev smoothness, one can replace the set correspond- 
ing to smoothness order 1 by any Sobolev ellipsoid of smoothness P > 0, see 
for instance [10] where the case /J = 2 is explored. Roughly speaking, the role 
of the smoothness assumption is to reduce the statistical model with infinite- 
dimensional parameter f to a finite-dimensional model having good approxi- 
mation properties. Any value of smoothness order ^ > leads to this reduction. 
The value ;8 = 1 is chosen for simplicity of exposition only. 

3. Idealized setup: Gaussian white noise model. To convey the main ideas 
without taking care of some technical details, we start by focusing our atten- 
tion on the Gaussian white noise model, that was proved to be asymptotically 
equivalent to the model of regression [6, 35], as well as to other nonparametric 
models [7,13]. Thus, we assume that the available data consists of Gaussian pro- 
cess {Y{(j}] : e L2([o, i]rf;M)} such that 



It is well-known that these two properties uniquely characterize the probability 
distribution of a Gaussian process. An alternative representation of Y is 



where W{x] is a -parameter Brownian sheet. Note that minimax estimation 
and detection of the function f in this set-up (but without sparsity assumption) 
was studied by [23] . 




d Y{x) ^ f (jc) dx + n~^'^dW{xl 



xe[0,lf, 
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3.1. Estimation of J by multiple hypotheses testing. We intend to tackle the 
variable selection problem by multiple hypotheses testing; each hypothesis con- 
cerns a group of the Fourier coefficients of the observed signal and suggests that 
all the elements within the group are zero. The rationale behind this approach is 
the following simple observation: since the trigonometric basis is orthonormal 
and contains the constant function, 

j^J ^ 0k[i] = {f,^Pk} = O,^k s.t. kj^O. (3) 

This observation entails that if the intrinsic dimension |/| is small as compared 
to d, then the sequence of Fourier coefficients is sparse. Furthermore, as ex- 
plained below, there is a sort of group sparsity with overlapping groups. 

For every £ ^ {!,..., d*}, we denote by Pf the set of all subsets I of {!,..., d] 
having exactly £ elements: Pf = {I cz {!,..., d] : Card(/) = £]■ For every multi- 
index fc e Z*^, we denote by supp(fc) the set of indices corresponding to nonzero 
entries of fc. To define the blocks of coefficients 6^ that will be tested for signifi- 
cance, we introduce the following notation: for every I (Z{l,...,d] and for every 
j e /, we set 

Vj[f] = [9k[f]:jesupp{lc)czl). 
It follows from (3) that the characterization 

j^j ^ max II l^^'[f] 11^ = 0, (4) 

holds true for every p e [0, +oo] . Furthermore, again in view of (3) , the maximum 
over / of the norms 1 1 1^^ [f ] 1 1 ^ is attained when I — J and is equal to the maximum 
over all subsets / such that Card(/) < d*. Summarizing these arguments, we can 
formulate the problem of variable selection as a problem of testing d null hy- 
potheses 

Hoj- |K^[f]||p^O V/c{l,...,<i} such that Card(/)<<i*. (5) 

If the hypothesis Hqj is rejected, then the jth covariate is declared as relevant. 
Note that by virtue of assumption [CI], the alternatives can be written as 

Hij-. \\v/[f]\\l>K for some /c {!,..., £i} such that Card(/)<^i*. (6) 

Our estimator is based on this characterization of the sparsity pattern. If we de- 
note by yk the observable random variable Y{ipk), we have 

yk = 0k[n + n-''Hk, ek = {f,^k), keZ^, (7) 

where {^t; fc e Z^} form a countable family of independent Gaussian random 
variables with zero mean and variance equal to one. According to this property. 
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yk is a good estimate of 6k [f ] : it is unbiased and with a mean squared error equal 
to l/n. Using the plug-in argument, this suggests to estimate Vj by Vj = (y^ '■ 
j e supp(fc) c /) and the norm of Vj by the norm of Vj . However, since this 
amounts to estimating an infinite-dimensional vector, the error of estimation 
will be infinitely large. To cope with this issue, we restrict the set of indices for 
which 6k is estimated by y/t to a finite set, outside of which 6k will be merely 
estimated by 0. Such a restriction is justified by the fact that f is assumed to 
be smooth: Fourier coefficients corresponding to very high frequencies are very 
small. 

Let us fix an integer m > 0, the cut-off level, and denote, for j e / c {1, . . . , <i|, 

sli = [keZ'' : \\k\\2<m and {j} c supp(fc) c /}. 

Since the alternatives Hij are concerned with the 2-norm, we build our test 
statistic on an estimate of the norm ||l^^[f]||2.Tothis end, we introduce 

Qm,/ = Sites;. , (^fc--)' 

which is an unbiased estimator of ^ = Xifces^ ^k- Note that when m — > oo, 
the quantity j approaches || l^^ [f] Il2- h is clear that larger values of m lead to 
a smaller bias while the variance get increased. Moreover, the variance of j 
is proportional to the cardinality of the set j. The latter is an increasing func- 
tion of Card(/). Therefore, if we aim at getting comparable estimation accuracies 
when estimating the functionals 1 1 [f ] 1 12 by ^ for various I's, it is reasonable 
to make the cut-off level m vary with the cardinality of /. 

Thus, we consider a multivariate cut-off m = (mi, . . . , m^*) e N''*. For a subset 
I of cardinality £ < d*, we test significance of the vector Vj [f] by comparing its 
estimate Q^^ j with a prescribed threshold A^. This leads us to define an estima- 
tor of the set / by 

Tn{m,X) = \j ^{l,...,d} : maxAr^maxQ^^ , >l[. 

<- l<d* lePf ^ 

where m = (mi,...,md*) e N*^* and A = (Ai,...,Ad*) e M^* are two vectors of 
tuning parameters. As already mentioned, the role of m is to ensure that the 
truncated sums j do not deviate too much from the complete sums Q|. 
Quantitatively speaking, for a given t > 0, we would like to choose mf 's so that 
/ ^ kt/t + I, where s — Card(/). This guarantee can be achieved due to the 
smoothness assumption. Indeed, as proved in (26) (cf. Appendix B), it holds that 

Qi^j>K-mfLs, \fjej. 
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Therefore, choosing mi = [iL[l + t]/k) , for every £ — l,...,d*, entails the 
inequality j > kt/t + 1, which indicates that the relevance of variables is 
not affected too much by the truncation. 

Pushing further the analogy with the hypotheses testing, we define Type I er- 
ror of an estimator /„ of / as the one of having /„ /, i.e., classifying some 
irrelevant variables as relevant. The Type II error is then that of having J ^ J, 
which amounts to classifying some relevant variables as irrelevant. As in test- 
ing problem, handling the Type I error is easier since the distribution of the test 
statistic is independent of f . In fact, this is the max of a finite family of random 
variables drawn from translated and scaled j^. distributions. Using the Bonfer- 
roni adjustment, leads to the following control of the first kind error. 

Proposition 1. Let us denote by N{£,y] the cardinality of the set {fc e : 
\\k\\l < ki 0} . If for some A > 1 and for every £ = l,...,d*, 

2^jAN{£, mf/£)d* log{2ed/d*] + 2Ad* log{2ed/d*) 

> . (8) 

n 

then the Type I error P[Jn{m , ^) ^ J] is upper-bounded by (2ed/d*)~'^*^^~^\ and 
therefore tends toOas d ^ +oo. 

This proposition shows that the Type I error of a variable selection procedure 
may be made small by choosing a sufficiently high threshold. By doing this, we 
run the risk to reject Hqj very often and to drastically underestimate the set of 
relevant variables. The next result establishes a necessary condition, which will 
be shown to be tight, ensuring that such an underestimation does not occur. 

Theorem 1. Let condition [C1(k', L)] be satisfied with some known constants 
K >0 and L<oo and let s = Card(/). For some real numbers t > and A > I, set 
me = [iL{l + t]/k] , £ = l,...,d*, and define Xi to be equal to the right-hand 
side of (8). If the condition 

4As </ct/(1 + t) (9) 

is fulfilled, then Jn{m,X) is consistent and satisfies the inequalities V[J„{m,^)^ 
/) <2(2e^i/rf*)-rf*(^-i) andP{Tnim,X]y^ /) < 3(2eti/rf*)-^*(^-i). 

Condition (9) ensuring the consistency of the variable selection procedure /„ 
admits a very natural interpretation: It is possible to detect relevant variables if 
the degree of relevance k is larger than a multiple of the threshold kg, the latter 
being chosen according to the noise level. 
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A first observation is that this theorem provides interesting insight to the pos- 
sibility of consistent recovery of the sparsity pattern / in the context of fixed in- 
trinsic dimension. In fact, when d* remains bounded from above when n — > cxd 
and <i — > 00, then we get that P(/i(m, A) = /) -^n,d-*oo 1 provided that 

log < Const- n. (10) 

Although we did not find (exactly) this result in the statistical literature on vari- 
able selection, it can be checked that (10) is a necessary and sufficient condition 
for recovering the sparsity pattern / in linear regression with fixed sparsity d* 
and growing dimension d and sample size n. Thus, in the regime of fixed or 
bounded d*, the sparsity pattern estimation in nonparametric regression is not 
more difficult than in the parametric linear regression, as far as only the consis- 
tency of estimation is considered and the precise value of the constant in (10) is 
neglected. Furthermore, there is a simple estimator /„ of / (cf. Eq. (3) in [10]), 
which is provably consistent under condition (10). This estimator can be seen as 
a procedure of testing hypotheses Hqj of form (5) with p = oo and, therefore, it 
does not really exploit the structure of the Fourier coefficients of the regression 
function. To some extent, this is the reason why in the regime of growing intrin- 
sic dimension d* — > oo, the estimator /„ proposed by [10] is no longer optimal. 

In fact, when d* oo, the term N{s, m^/s) present in (9) tends to infinity as 
well. Furthermore, as we show in Section 4, this convergence takes place at an 
exponential rate in d* . Therefore, in this asymptotic set-up it is crucial to have 
the right order of N{s, m^/s) in the condition that ensures the consistency. As 
shown in Section 5, this is the case for condition (9). 

Remark 1. An apparent drawback of the estimator /„ is the large dimen- 
sionality of tuning parameters involved in /„ . However, Theorem 1 reveals that 
for achieving good selection power, it is sufficient to select the 2(i* -dimensional 
tuning parameter (m, A) on a one-dimensional curve parameterized by # = L(l+ 
t]/k. Indeed, once the value of ^ is given, Thm. 1 advocates for choosing 

2JANi£,'»)d*log[2ed/d*) + 2Ad*log[2ed/d*] 

m( = {t-dfl^ and Xf = ^ ^ — — — - (11) 

n 

for every i — \,...,d* . As discussed in Section 6.1, this property allows to relax 
the requirement that the values L and k involved in [CI] are known in advance. 

Remark 2. The result of the last theorem is in some sense adaptive w.r.t. the 
unknown sparsity. Indeed, while the estimator /„ involves d*, which is merely 
a known upper bound on the true sparsity 5 = Card(/) and may be significantly 
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larger than 5, it is the true sparsity 5 that appears in condition (9) as a first argu- 
ment of the quantity Ar(-,#). This point is important given the exponential rate 
of divergence of N{-,i}] when its first argument tends to infinity. On the other 
hand, if condition (9) is satisfied with A[((i*,#) instead of W(Card(/),#), then the 
consistent estimation of / can be achieved by a slightly simpler procedure: 



The proof of this statement is similar to that of Thm. 1 and will be omitted. 

4. Counting lattice points in a ball. The aim of the present section is to in- 
vestigate the properties of the quantity N{d*,y) that is involved in the conditions 
ensuring the consistency of the proposed procedures. Quite surprisingly, the 
asymptotic behavior of N{d*,Y) turns out to be related to the Jacobi -function. 
To show this, let us introduce some notation. For a positive number y, we set 



along with Ni{d*,Y) = Caxd^ild* ,y) and N2{d*,Y) = Card'^2{d*,Y)- In simple 
words, Ni[d*,Y) is the number of lattice points lying in the ^i*- dimensional ball 
with radius (^(i*)^/^ and centered at the origin, while N2{d*,Y) is the number of 
(integer) lattice points lying in the [d* — l)-dimensional baU with radius [Yd*Y^^ 
and centered at the origin. With this notation, the quantity N{£, •) of Theorem 1 
can be written as Ni{£, •] — N2{£, •)• By volumetric arguments, one can check that 
V{d*X^-l)''*{d*y*'^ < Ni{d*,Y) < V{d*X^+ l)''*((i*)^*/2, where V{d*] = 
7r^*/2/r(l + d*/2) is the volume of the unit ball in M''*. Furthermore, similar 
bounds hold true for N2{d*,Y) as well. Unfortunately, when d* 00, these in- 
equalities are not accurate enough to yield non-trivial results in the problem of 
variable selection we are dealing with. This is especially true for the results on 
impossibility of consistent estimation stated in Section 5. 

In order to determine the asymptotic behavior of Ni{d*,Y) and N2{d*, y) when 
d* tends to infinity, we will rely on their integral representation through Jacobi's 
-function. Recall that the latter is given by h(z) = XreZ'^'^ ' which is well de- 
fined for any complex number z belonging to the unit ball |z| < 1. To briefly ex- 
plain where the relation between Ni{d*,Y) and the 0-function comes from, let 
us denote by {a^} the sequence of coefficients of the power series of h(z)^*, that 
is h(z)''* =Xr>o^'-z''. One easily checks thatVr eN, Ur = Card{fc e Z^* : kf+...+ 

k^t = r}. Thus, for every y such that yd* is integer, we have Ni{d*,Y] — Xrf 
As a consequence of Cauchy's theorem, we get : 




^i{d*,Y) = { 
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Fig 1. Lattice points in a ball of radius R = yd* = 3.2 in t/ie fferee dimensional space (d* = 3j. 
i?erf points are those ofVi{d*,Y) \ '^2{d*,j) while blue stars are those ofVzid*,Y). In this example, 
N[d*,Y) = N(3,l.Q7) = llO. 



where the integral is taken over any circle \z\ = w with < if < 1. Exploiting 
this representation and applying the saddle-point method thoroughly described 
in [14], we get the following result. 

Proposition 2. Lety >0 bean integer and let\Y{z) — logh(z) — flogz. 
1. There is a unique solution Zy in{Q,\) to the equationVjz) = 0. Furthermore, 



the function z^ is increasing and \'^{z) > 0. 
2. For i = 1,2, the following equivalences hold true: 



Ni{d*,r)^ 

as d* tends to infinity. 



1 + 0(1) 



hiz^y-^Zril - Zy){2\'l{Zy)nd*)y^ ' 



Hereafter, it will be useful to note that the second part of Prop. 2 yields 



log{Ni{d*,r)-N2[d*,Y))=d*\y{Zy)--logd* + Cy + o{ll asd*-^oo, (12) 



log[- 



h(z,)-l 



= j . Furthermore, while the asymptotic equiva- 



With Cj xv^g h(z^)z^(l-z^)^27rl'f(z^)> 

lences of Prop. 2 are established for integer values of 7 > 0, relation log [Ni{d*,Y)- 
N2{d*,Y)) = d*\j-{Zj-){l + 0(1)) holds true for any positive real number y [30]. In 
order to get an idea of how the terms Zj- and Ij'(z^) depend on y, we depicted in 
Fig. 2 the plots of these quantities as functions of 7 > 0. 

Combining relation (12) with Thm. 1, we get the following result. 
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Fig 2. The plots of mappings y ^ Zj and j \j[Zj). One can observe that both functions are in- 
creasing, the first one converges to 1 veiy rapidly, while the second one seems to diverge very slowly. 

Corollary 3. Let condition [C1(k:, L)] be satisfied with some known constants 
K > and L < oo. Consider the asymptotic set-up in which both d = d„ and 
d* — d*^ tend to infinity as n — > oo. Assume thatd grows at a sub-exponential rate 
inn, that is loglog d — o{logn). If 

d* 2 

limsup- — <rr~; 

with Y = L/k, then consistent estimation of J is possible and can be achieved, for 
instance, by the estimator Jn . 



5. Tightness of the assumptions. In this section, we focus our attention on 
the functional class 11{k, L) of all functions satisfying assumption [CICk", L)]. For 
emphasizing that / is the sparsity pattern of the function f , we write /f instead 
of /. We assume that 5 — Card(/) = d*. The goal is to provide conditions un- 
der which the consistent estimation of the sparsity support is impossible, that is 
there exists a constant c> and an integer no ^ N such that, iin> no, 

inf sup Pf{J^Jf)>c, 

J feE(K-,L) 

where the inf is over all possible estimators of Jf. To this end, we introduce a set 
of M + 1 probability distributions /io, . . . , Pm on Y.{k, L] and use the fact that 



inf sup 



Pf(/7^/f)>inf „^ 
/ M 



M 

-y 



P{{J7^Jf)pe[df). 



(13) 



e=i -JnK.L) 



These measures p( will be chosen in such a way that for each £>l there is a set 
Ji of cardinality d* such that pe{Jf = } = 1 and all the sets J\,...,Jm are distinct. 
The measure p^ is the Dirac measure in 0. Considering these pis as "priors" on 
TXk, L) and defining the corresponding "posteriors" Po.IPi> • • • .Pm by 



for every measurable set A c : 
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we can write the inequality (13) as 




(14) 



where the inf is taken over all random ijj taking values in {0 M]. The latter 

inf will be controlled using a suitable version of the Fano lemma. To state it, 
we denote by ^[P, Q) the KuUback-Leibler divergence between two probability 
measures P and Q defined on the same probability space. 

Lemma 4 (Corollary 2.6 of [39]). Let M >3 be an integer, {3C be a mea- 
surable space and let Pq,...,Pm be probability measures on {3C ,.-f^). Let us set 
pe.M = inf^ Pf (V* 7^ ^) ' where the inf is taken over all measurable func- 

tions ip : ^ {1,...,M}. IfforsomeO< a < I, j^J^'^=^.^{Pi,Po) < alogM, 
then pe,M > I - «• 

We apply this lemma with 3C being the set of all arrays y = {yjt : fc e Z'^ } such 
that for some K>0 the entries = for every k larger than K in £2 -norm. It fol- 
lows from Fano's lemma that one can deduce a lower bound on pe,M, the quan- 
tity we are interested in, from an upper bound on the average KuUback-Leibler 
divergence between and Pq. With these tools at hand, we are in a position to 
state the main result on the impossibility of consistent estimation of the sparsity 
pattern in the case when the conditions of Thm. 1 are violated. 

Theorem 2. Assume thatd = L/k> I and (^,) > 3. Le? 7;, be the largest integer 
satisfyingY[l + [h{zj)-l)~^) <f, where the Jacobi 6 -function h andzj arethose 
defined in Section 4. 

i) Iffor some a e{0, 1/2), 



N{d*,r,)d*\og{dld*) ■& 




(15) 



then, for d* large enough, inf jsupfg^; Pf(/7^/f)>2 — 
ii) Iffor some a ^{0,1/ 2), 

d*log[d/d*] K 



(16) 



n a' 



thenmfjSupf^Y.Pf{j^ /f) > | - a. 



It is worth stressing here that condition (15) is the converse of condition (9) of 
Thm. 1 in the case d* 00, in the sense that condition (9) amounts to requiring 
that the left-hand side of (15) is smaller than some constant. There is however 
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one difference between the quantities involved in tiiese conditions: tiie term 
N[d*,'&{l + t)) of (9) is replaced by N{d*,Yg) in condition (15). One can won- 
der how close Yg is to To give a qualitative answer to this question, we plotted 
in Figure 3 the curve of the mapping 1? — > 7^ along with the bisector # — > #. We 
observe that the difference between two curves is small compared to As we 
discuss it later, this property shows that the constants involved in the necessary 
condition and in the sufficient condition for consistent estimation of / are very 
close, especially for large values of i?. 




~i 1 r 

20 40 

Fig 3. The curve of the function L 



r 

60 80 100 

(blue) and the bisector (red). 



6. Adaptivity and minimax rates of separation. 

6.1. Adaptation with respect to L and k. The estimator J{m,X) we have in- 
troduced in Section 3 is clearly nonadaptive: the tuning parameters (m, A) rec- 
ommended by the developed theory involve the values L and k, which are gen- 
erally unknown. Fortunately, we can take advantage of the fact that the choice of 
m and A is governed by the one-dimensional parameter # — L(l + t)/k. There- 
fore, it is realistic to assume that a finite grid of values 1 < #1 < . . . < #js: < 00 is 
available containing a true value of #. The following result provides an adaptive 
procedure of variable selection with guaranteed control of the error. 



Proposition 5. Let 1 < #1 < . . . < < 00 and t>0 be given values and set^ 

maxj=i dXfcez 



I 



■ mm 



[i:{l + T)- 



.9. 



'4 



< #,4 < K. 



For every i, I ^fi, let us denote } n[i) — Jn[tn['&i),X[fi)) u>ithm(X'&) = [f£y^^ and 



2y/2Ni£,^)d*log{2ed/d*) + 4d* log{2ed/d*] 



^We use the convention that the minimum over an empty set equals +00. 
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If the condition AX s{fi*) < k:t/(1+t) isfulfllled, then the estimator J^'^ — [jf^i Jn{i) 
satisflesPUf 7^ /) < {K + 2){d* /Zed^* . 

In simple words, if the grid of possible values {^, } has a cardinality K which 
is not too large (that is K{d* IdY* — > 0), then declaring a variable relevant if 
at least one of the procedures suggests its relevance provides a consis- 
tent and adaptive variable selection strategy. The proof of this statement fol- 
lows immediately from Prop. 1 and Thm. 1. Indeed, applying Prop. 1 with A = 2 



yields VUf <^J)< S^iP(/«(') '^J)< K{d*l2edY*, whUe Thm. 1 ensures that 
?5 /) < P(/„(«*) ?S /) < 2{d*l2edY\ 



6.2. Minimax rates of separation. Since the methodology of Section 3 takes 
its roots in the theory of hypotheses testing, one naturally wonders what are 
the minimax rates of separation in the problem of variable selection. The re- 
sults stated in foregoing sections allow us to answer this question in the case of 
Sobolev smoothness 1 and alternatives separated in L^-norm. The following re- 
sult, the proof of which is postponed to the Appendix E provides minimax rates. 
We assume herein that the true sparsity 5 = Card(/) and its known upper esti- 
mate d* are such that d*ls is bounded from above by some constant. 

Proposition 6. There is a constant D* depending only on L such that if 



then there exists a consistent estimator of J. Furthermore, the consistency is uni- 
form in f e Y,[k, L). On the other hand, there is a constant D^, depending only on L 
such that if 



then uniformly consistent estimation of J is impossible. 

Borrowing the terminology of the theory of hypotheses testing, we say that 



( „2 ] V is the minimax rate of separation in the problem of 



variable selection for Sobolev smoothness one. These results readily extend to 
Sobolev smoothness of any order j8 > 1, in which case the rate of separation 



takes the form risM^Zfl") 2^/(4^+s) ^ £iog(d/s) ^^^^ term in this maximum co- 



incides, up to the logarithmic term, with the minimax rate of separation in the 
problem of detection of an 5 -dimensional signal [22]. Note, however, that in our 
case this logarithmic inflation is unavoidable. It is the price to pay for not know- 
ing in advance which s variables are relevant. 
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7. Nonparametric regression with random design. So far, we have ana- 
lyzed the situation in which noisy observations of the regression function f(-) 

are available at all points x e [0, 1] ^ . Let us turn now to the more realistic model 
of nonparametric regression, when the observed noisy values of f are sampled 
at random in the unit hypercube [0, l]*^. More precisely, we assume that n in- 
dependent and identically distributed pairs of input-output variables (X,, Yf], 
i = l,...,n are observed that obey the regression model 

Yi=f{Xi) + (7Si, i = l,...,n. 

The input variables Xi,...,X„ are assumed to take values in while the output 
variables Yi,...,Y„ are scalar. As usual, s\,...,En are such that E[e,|X,] = 0, « = 
l,...,n; additional conditions will be imposed later. Without requiring from f to 

be of a special parametric form, we aim at recovering the set / c {1 of its 

relevant variables. The noise magnitude cr is assumed to be known. 

It is clear that the estimation of / cannot be accomplished without impos- 
ing some further assumptions on f and on the distribution Px of the input vari- 
ables. Roughly speaking, we will assume that f is differentiable with a squared 
integrable gradient and that Px admits a density which is bounded from below. 
More precisely, let g denote the density of Px w.r.t. the Lebesgue measure. 

[C2] g(jc) = Oforanyx^[0,l]^ and that g(j»;) > gmin for any x e [0, 1]^. 

The next assumptions imposed to the regression function and to the noise re- 
quire their boundedness in an appropriate sense. These assumptions are needed 
in order to prove, by means of a concentration inequality, the closeness of the 
empirical coefficients to the true ones. 

[C3(Loo,i2)] The L~([0, \Y ,^,Px) and L\[Q, norms of the function f 

are bounded from above respectively by Loo and L2, i.e., P(|f(A')| < Loo) = 1 
andE[f(X)2]<L|. 

[C4] The noise variables satisfy a.e. E[e^*' |Xj] < e*^/^ for all t > 0. 

We stress once again that the primary aim of this work is merely to under- 
stand when it is possible to consistently estimate the sparsity pattern. The esti- 
mator that we will define is intended to show the possibility of consistent esti- 
mation, rather than being a practical procedure for recovering the sparsity pat- 
tern. Therefore, the estimator will be allowed to depend on the parameters gmin, 
L, K and L2 appearing in conditions [C1-C3]. 

7.1. An estimator of J and its consistency. The estimator of the sparsity pat- 
tern / that we are going to introduce now is based on the following simple ob- 
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servation: if j ^ / then Ok [f ] = for every k such that kj 0. In contrast, if 7 e / 
then there exists fc E Z'' with kj 7^ such that |0fc[f]| > 0. To turn this observa- 
tion into an estimator of /, we start by estimating the Fourier coefficients Ok [f] 
by their empirical counterparts: 



n 



Then, for every £ e N and for any 7 > 0, we introduce the notation ^ = {fc e 
Zd- : ||fc||2<m, ||fc||o<^,A;j 7^0}. The estimator of /is defined by 

f^\m,X)^\i^{l,...,d}:va2^^^^, |0fcl>A}, (17) 

where m and A are some parameters to be defined later. The next result, the 
proof of which is placed in the supplementary material, provides consistency 
guarantees for fn^{m,X). 

Theorem 3. Let conditions [C1-C4] be fulfilled with some known values gmin. 
d = 2LIk and L2. Assume fiirthermore that the design density g and an upper 
estimate on the noise magnitude a are available. Set m = {-ffd*)^^^ and A = 4(c7 + 
L2][d* log{24^/^d / d*)/ n g'^^^y^^ . If the following conditions are satisfied: 

d*logi24V^d/d*) L\ l28{cr + L2fd*N[d*,f)log[24V^d/d*) 

— T9~' 5 (1^^ 

n Lt, ns 

00 Orr- 



'mm 



then the estimator J^^\m,X) satisfies P{T-^\m,?\.) 7^ J) < {M/d*)-'^*. 



If we take a look at the conditions of Theorem 3 ensuring the consistency of 
/;, , it becomes clear that the strongest requirement is the second inequality in 
(18). Roughly speaking, this condition requires that d*N{d*,'&)\og{dld*)/n is 
bounded from above by some constant. According to results stated in Section 4, 
N{d*,d) diverges exponentially fast, making inequality (18) impossible for d* 
larger than log n up to a multiplicative constant. 

It is also worth stressing that although we require the Px-a.e- boundedness of 
f by some constant Loo. this constant is not needed for computing the estima- 
tor proposed in Thm. 3. Only constants related to some quadratic functionals 
of the sequence of Fourier coefficients 0fc [f ] are involved in the tuning parame- 
ters m and A. This point might be important for designing practical estimators 
of /, since the estimation of quadratic functionals is more realistic, see for in- 
stance [27, 9], than the estimation of sup-norm. 

Theorem 3 can be reformulated to characterize the level of relevance k for 
the relevant components of X making their identification possible. In fact, an 
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alternative way of stating Theorem 3 is tiie following: under conditions [Cl- 
C4] if # is an arbitrary tuning parameter satisfying the first inequality in (18), 
then the estimator /„ (m, A) — with m and A chosen as in Theorem 3 — satisfies 
'P[jll\m,?i) 7^ /) < {8d/d*]~'^* if the smallest level of relevance k for components 
Xj of X with 7 e / is not smaller than 8/\.^N{d*, m^/d*). This statement can be 
easily deduced from the proof of Theorem 3 (cf. supplementary material). 

7.2. Tightness of the assumptions. A natural question is now to check that 
the assumptions of Thm. 3 are tight in the asymptotic regimes of fixed sparsity 
and increasing ambient dimension, as well as increasing sparsity. We will only 
establish an analogue of claim ii) of Thm. 2. An attempt to prove a result similar 
to claim i) of Thm. 2 was done in [11, Theorem 2]. However, the result of [11] 
involves a stringent assumption on the empirical Gram matrix {cf. condition (6) 
in [11]) and, unfortunately, we are unable to prove the existence of a sampling 
scheme for which this assumption is fulfilled. 

We assume that the errors Si are i.i.d. standard Gaussian and we focus our 
attention on the functional class Y.{k, L). The following simple result shows that 
the conditions of Thm. 3 are tight in the case of fixed intrinsic dimension. 

Proposition 7. Let the design Xi, . . . ,X„ e [0, 1]"^ be either deterministic or 
random. If for some positive a < 1/2, the inequality 

d*log{d/d*] _i 

> Ka 

n 

holds true, then there is a constant c > such that infj^ ^^PfeUK.L) Pf (/« 7^ /f ) > c. 

8. Concluding remarks. The results proved in previous sections almost ex- 
haustively answer the questions on the existence of consistent estimators of the 
sparsity pattern in the model of Gaussian white noise and, to a smaller extent, 
in nonparametric regression. In fact as far as only rates of convergence are of 
interest, the result obtained in Thm. 1 is shown in Section 5 to be unimprovable. 
Thus only the problem of finding sharp constants remains open. To make these 
statements more precise, let us consider the simplified set-up a = k = I and 
define the following two regimes: 

• The regime of fixed sparsity, i.e., when the sample size n and the ambi- 
ent dimension d tend to infinity but the intrinsic dimension d* remains 
constant or bounded. 

• The regime of increasing sparsity, i.e., when the intrinsic dimension d* 
tends to infinity along with the sample size n and the ambient dimension 
d. For simplicity, we will assume that d* = Oid^~'^) for some e > 0. 
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In the fixed sparsity regime, in view of Tiieorems 1 and 3, consistent estimation 
of tiie sparsity pattern can be aciiieved both in Gaussian white noise model and 
nonparametric regression as soon as limsup„_,j^(rf*logrf)/n < C-i,, where C-i, is 
the constant defined by c* = 1 /8 for the Gaussian white noise model and 



for the regression model. On the other hand, by Thm. 2 and Prop. 7, consistent 
estimation of the sparsity pattern is impossible if liminf„_,oo(<^* log (i)/n > c* 
with c* = 2. Thus, up to multiplicative constants and c* (which are clearly 
not sharp), the results of Theorems 1 and 3 cannot be improved in the regime of 
fbced sparsity. 

In the regime of increasing sparsity, the results we get in the model of Gaus- 
sian white noise are much stronger than those for nonparametric regression. In 
the former model, taking the logarithm of both sides of inequality (9) and using 
formula (12) for N{d*, ■) = Ni{d*,-) — Nzid*,-], we see that consistent estimation 
of / is possible when, for some t > and for all n, the following two conditions 
are fulfilled: 



with some constants = £j(i, t) and c'^ = c'^(i, t). On the other hand, Thm. 2 
yields that there are some constants ci and c[ such that it is impossible to con- 
sistently estimate / if either one of the conditions 



is satisfied. First note that the left-hand side of the second condition in (19) is 
exactly the same as the left-hand side of (21). If we compare now the left-hand 
side of the first condition in (19) with the left-hand side of (20), we see that only 
the coefficients of d* differ. To measure the degree of difference of these two 
coefficients we draw in Figure 4 the plots of the functions L IlIzl) and L 

with Yl as is Thm. 2. One can observe that the two curves are very close 
especially for relatively large values of L. This implies that the conditions (19) are 
tight. A simple consequence of inequalities (19) and (20) is that the consistent 
recovery of the sparsity pattern is possible under the condition d*/\ogn and 
impossible for d*/logn — > cxd as n — > oo, provided that loglog(d/d*) = o(logn). 




I i+^{z L+T)d* + \ log d* + \og\og{d Id*) - 2log n < c 
\ogd* + \og\og[d I d*) — \ogn < Cj 



(19) 




(20) 



(21) 
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Fig 4. The curves of functions L Il(zl) (blue curve) and L {Zj^ ) (red curve). 

Still in the regime of increasing sparsity, but for nonparametric regression, we 
proved that consistent estimation of the sparsity pattern is possible whenever 

\L+T{ZL+T)d* + \ log d* + loglog(d/rf*) - log n < £2, 
logrf* + loglogd -logn < £2 

with some constants £2 = £2(gmin,cr,L2.i) and £2 = 2log(L2/ioo)- As we have 
already mentioned, the second condition in (22) is tight, up to the choice of £2, 
in view of Proposition 7. It is natural to expect that the first condition is tight as 
well, since it is in the model of Gaussian white noise, which has the reputation 
of being simpler than the model of nonparametric regression. However, we do 
not have a mathematical proof of this statement. 

Let us stress now that, all over this work, we have deliberately avoided any dis- 
cussion on the computational aspects of the variable selection in nonparametric 
regression. The goal in this paper was to investigate the possibility of consistent 
recovery without paying attention to the complexity of the selection procedure. 
This lead to some conditions that could be considered a benchmark for assess- 
ing the properties of sparsity pattern estimators. As for the estimators proposed 
in Section 3, it is worth noting that their computational complexity is not always 
prohibitively large. A recommended strategy is to compute the coefficients Qu 
in a stepwise manner; at each step K = \,2,...,d* only the coefficients 6k with 
||fc||o = K need to be computed and compared with the threshold. If some 9k 
exceeds the threshold, then all the variables Xi corresponding to nonzero coor- 
dinates of k are considered as relevant. We can stop this computation as soon 
as the number of variables classified as relevant attains d*. While the worst- 
case complexity of this procedure is exponential, there are many functions f for 
which the complexity of the procedure will be polynomial in d. For example, 
this is the case for additive models in which f (x) = fi(X;j )+ . . . +fd*(x;^» ) for some 
univariate functions f 1 , . . . , f^* . 
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Note also that in the present study we focused exclusively on the consistency 
of variable selection without paying any attention to the consistency of regres- 
sion function estimation. A thorough analysis of the latter problem being left to 
a future work, let us simply remark that in the case of fixed d*, under the con- 
ditions of Thm. 3, it is straightforward to construct a consistent estimator of the 
regression function. In fact, it suffices to use a projection estimator with a prop- 
erly chosen truncation parameter on the set of relevant variables. The situation 
is much more delicate in the case when the sparsity d* grows to infinity along 
with the sample size n. Presumably, condition (19) is no longer sufficient for 
consistently estimating the regression function. The rationale behind this con- 
jecture is that the minimax rate of convergence for estimating f in our context, 
if we assume in addition that the set of relevant variables is known, is equal to 
^-2/(2+d*) — exp(— 2logn/(2 + <i*)). If the left-hand side of (19) is equal to acon- 
stant and log log — o(logn), then the aforementioned minimax rate does not 
tend to zero, making thus the estimator inconsistent. 

Finally, we would like to mention that the selection of relevant variables is 
a challenging statistical task, which might be useful to perform independently 
of the task of regression function estimation. Indeed, if we succeed in identify- 
ing relevant variables on a data-set having a small sample size, we can continue 
the data collection process more efficiently by recording only the values of rel- 
evant variables. This may considerably reduce the memory costs related to the 
data storage and the financial costs necessary for collecting new data. Then, the 
regression function may be estimated more accurately on the base of this new 
(larger) data-set. 

APPENDIX A: PROOF OF PROPOSITION 1 

To ease notation, we write /„ instead of /„(m, A). It is clear that /„ / if and 
only if 3; e such that maxi<rf* XJ^ raax^^pd ^ > 1, where Q'^ j = Y.,^^gj ^ B^- 

For every; e { 1, . . . , ti}, let us set i?^ ^ = Hkes' ^^1 ~ ^^'^ 
<.7 = (Qi.7)-^/'i:fces^„/fc?/tsothat 



For 7 e J*^, the first two terms of the last sum vanish and, therefore, we have 

{/„^/}-U U UK/^"M=U U U {<7^"4 

jere<d*iepf e<d* lePf jej'ni 



where the last equality results from the fact that 7?^ ^ = if j ^ /. The random 
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variable R^^ p being a centered sum of squares of independent standard Gaus- 
sian random variables, follows a translated j^. distribution. The tails of this dis- 
tribution can be evaluated using the following result. 

Lemma 8 (cf. Lemma 1 in [27]). Let ^i, be independent standard Gaus- 
sian random variables. For every x > and for every vector a — (ai, ao) £ K^, 
the following inequalities hold true: 

p(Xti " - 2||«ll2 V^+2||fl||oox) < exp(-x), 
" - -2|l«ll2V^) < exp(-x). 

We apply this lemma to i?^^ j, for which ||a||oo = 1 and \\a\\\ = N{i, mj/i). 
Setting nXt=2 ^J N[l,rnj /l)x + 2x and using the union bound, we get 

d* 

< V £ Card(P/) max pf^^ > nX^ < e'"" V 

T~t IePf;ieI ^ '' J f— ' 



=1 -'^^ £=1 

One checks that XfLi^tf) — [2ed/d*Y holds true for every pair of integers 
[d*,d) such that l<d*<d (cf. supplementary material for a proof). Hence, for 
x-A^^*log(2ed/rf*),wegetP(7„ ^ }) <{2e d I d*)-^^'^'^'^*. 

APPENDIX B: PROOF OF THEOREM 1 
We begin with proving a stronger result that implies the claim of Thm. 1. 

Proposition 9. Let a be a real number from (0, 1). If for every j ^ J and for 
s = Card(/) the inequality 



^rusj — I 



2JN{s, m2/s)log(2s/a) + 1 
Xs + — ^ 



n 

holds true, then P(/ ^ Jn)^OL- 



11/2 


■2log(2s/a)" 


+ 


n 



1/2 -V 2 

\ (24) 



Proof. To bound from above the probability of Type II error, we rely on the 
equivalence: / <^ Jn if and only if e / such that max£<d* A^^ max^^pd / < 1- 
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Recall that s — Card(/). Using Bonferroni's inequality, we get 
P(/^/.)<i:,.,,P(rnaxA-m^Qi,<l) 

^Y.jeAQ^n..j<^s)<smaxP[^^^j<Xs). (25) 
By virtue of decomposition (23), 

^{qLj ^^s)=p [{/oiZ + -^<jf + ^ [<j - ^ • 

One checks that -R^^ j - {N^^^ jY + N[s, m^/s) is a drawn from distribution 
with N{s, m^/s]—l degrees of freedom. Therefore, using Lemma 8 stated in pre- 
vious section, we get P(^(<^ j - + ^ < -2^W(5, /s)log(2s/a)) < 
j^. Therefore, P(Q^^ j < A^) is upper-bounded by 

Using the condition of the proposition, we get P(Q^^ j < ^s) < ^ +P(^ms / - 
-y'21og(25/a)) < J. Combining this inequality with (25), we get the result of 
Proposition 9. □ 

To deduce the claim of Thm. 1 from that of Prop. 9, we use the following lower 
bound: 

QL„J=Q^- 2 ^lh\m2>m,}>K- ^ ^fcl{||fc||2>™.} 
j6supp(fc)c/ jesupp(fc)c/ 

>K-mf ^ el\\k\\l>K-mfLs, (26) 

jesupp(fc)c/ 

for every j e /. Our choice of mj, = \J sL(1 + t)Ik, ensures that Q^^ j > 
kt/{1 + t). Finally, using a very rough bound (which is sufficient for our pur- 
poses), the right-hand side in (24) can be upper-bounded by 4As if a is chosen 
to be equal to 2{2ed/d*)-^^-^'^''* . Therefore, if > 4As, then (24) holds true 
with a = 2{2e d / d*)~^^~^^'^* and, therefore, the type II error has a probability less 
than or equal to 2{2e d I d*)-'^^-^'^'^*. 

APPENDIX C: PROOF OF PROPOSITION 2 

Proof of the first assertion. This proof can be found in [30], we repeat here the 
arguments therein for the sake of keeping the paper self-contained. Recall that 
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Ni{d*,Y) admits an integral representation with the integrand: 



\\{zY* 1 1 

■exp 



h(£) 

zr 



zrd* z(l-z) z[l-z) 
For any y > 0, we define 0(y) = e-yW[e-y)/hie-y) = Y^ksi. /Hkez 
in such a way that 

^^^^ ' h{e-y) e~y ' 

By virtue of the Cauchy-Schwarz inequality, it holds that ^ k'^e~y^^ ^ e~y^^ > 
/c^e"^*^^)^, Vj/ e (0,oo), implying that ^'{y) < for all y E (0,oo), i.e., is 
strictly decreasing. Furthermore, (p is obviously continuous with limy^o ^(y) = 
+00 and limy^oo 0(y) = 0. These properties imply the existence and the unique- 
ness of Yy e (0, oo) such that (y,-) = y. Furthermore, as the inverse of a decreas- 
ing function, the function )- y^ is decreasing as well. We set Zj — e~yr so that 
I" >— > Zj. is increasing. We also have 



= - 0'(yr) - +r} = -zz^'cp'iyr) > o. 



Proof of the second assertion. We apply the saddle-point method to the integral 
representing A/^i see, e.g., Chapter IX in [14]. It holds that 

1 r h(z¥* dz I I , ..I 



iVi(^^*,r)- — 4) ^T^^i ^^:r^t {z{l-z)r^e"'r^'>dz. (27) 

The first assertion of the proposition provided us with a real number such 
that l^(z^) = and l"(Zj') > 0. The tangent to the steepest descent curve at z^ is 
vertical. The path we choose for integration is the circle with center and radius 
Zy. As this circle and the steepest descent curve have the same tangent at z^, 
applying formula (1.8.1) of [14] (with a = since \"{Zy) is real and positive), we 
get that 



/ 271 

{z[l-z)]-'e^*^ri-Uz^J——e'-/Hz,{l-z,)]-'e''''^^^^^^ 
|z|=z, \l d*\';{z,] 

when d* oo, as soon as the condition^ 9l[l^(z) — l)'(Z)')] < -pt is satisfied for 
some /i > and for any z belonging to the circle |z | = | and lying not too close 
to Zy. To check that this is indeed the case, we remark that 3l[l^(z)] — logl^^l. 



'^diu stands for the real part of the complex number u . 
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Hence, if z = z^e'-" with co e [cucZti — coq] for some coq e]0, 7r[, then 



h(z) 



zr 



l + 2z + 2Xk>i^''''\ ^ Il + ^l + ^r + 2i:fc>i< 



Y — J 



\l + e''^«ZY\ + ZY + 2Xk>iz'^' 
~z^ 



f l+2z +2V A 

Therefore 3l[l,(z) - 3ll,(z,)] < -p with /i = log ( ii^,^,..;,^,^;;^;^^,.. ] > 0- This 

completes the proof for the term N\{d*,Y). The term N2{d*,Y) can be dealt in 
the same way. 

APPENDIX D: PROOF OF THEOREM 2 

To prove i) we apply Lemma 4 with M = (^,) in conjunction with a standard 
result the proof of which can be found in [ 1 1 ] and in the supplementary material. 

Lemma 10. Let S be a subset of TA of cardinality |S| and Abe a constant. 
Define fig cis a discrete measure supported on the finite set of functions {f^^ = 
Xifces^'^fc'/'*: • ^ ^ {=1=1}^} such that ps(S — fo) = 2~l^l for every ci) e {±1}^. If we 
define the probability measure Ps by Ps(A) = J^^ ^ Pf{A)ps{di), for every mea- 
surable set A <zW , flnrfPo = Pf„, f/ien jr(Ps,Po)< |S|^4n2, 

Without loss of generality, we can assume k — \ (the general case can be 
reduced to this one by replacing L and n respectively by L/k and nn). Thus, 
d — L. We denote the set by E/, and choose po,---,PM as follows: po is 

the Dirac measure 5o, pi is defined as in Lemma 10 with S = 'i^i{d*,YL) and 
A = [N{d* ,Y . The measures p2,---,PM are defined similarly and corre- 
spond to the M — 1 remaining sparsity patterns of cardinality d*. 

In view of inequality (14) and Lemma 4, it suffices to show that the measures 
P( satisfy pe{Y.L]=lsLndY,tLo-y^OPe,'Po]<{M+l]alogM. Combining Lemma 10 
with Card(S) = Ni[d*,YL) and inequality (15), we get Jr(P^Po) < "^^rf'n)^^^ " 
rLN[d*rL) — '^l^S^- Now, let us show that pi{T.i) = 1. By symmetry, this will 
imply that pei^i] = 1 for every £. Since pi is supported by the set [fo) '■ <>> e 
it is clear that Xfc,/o^'fc[f'-] =A^\Ni{d*,YL)-N2{d*,YL)\ = 1 and, 

Y,k]Bl[U= ^i^'^^Z Z T^jA^<A^rLNi[d*,YL) 

keZ'^ ke'eiid'.ri) j =1 ke'ei(d* ,r l) 

^ Niid*,YL) . ^ 
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The results stated in Section 4 imply that A^i(rf*,)'i)/Ar(<i*,)'L) ~d*^oo l+(h(z^)- 
Our choice of yl ensures that, for d* large enough, f^^ e E/,. This completes 
the proof of claim i). To prove ii), we still use Lemma 4 with fio = 5o and ^( = 5ff , 
where for every ^ e {!,... ,M}, is chosen as follows. Let I\,...,Im be all the 
subsets of {!,... containing exactly d* elements. We define f^, for ^ 7^0, by its 
Fourier coefficients {0f : fc e Z''} as follows: 



1, k = {ki,...,kd) = {ll(Elt,---' Ide/J, 

, 0, otherwise. 



Obviously, all the functions f/- belong to S and, moreover, each has 7^ as spar- 
sity pattern. One easily checks that our choice of implies J^^(Pf^,Pf„) — n\\\( — 
follg = n. Therefore, if alogM = alog (^,) > n, the desired inequality is satisfied. 
To conclude it suffices to note that log (^,) > d*\og{dl d*). 

APPENDIX E: PROOF OF PROPOSITION 6 

In view of Thm. 1, applied with A = 2 and t = 1, the consistent (uniformly in 

T e T.[k,L)) estimation of / is possible if — — — ^ — ^lj — < |, 

Since d*/s is upper-bounded by some constant, there is a constant D*^ such that 
the left-hand side of the last display is upper-bounded by 



I ^N{s,2L/k)s \og[d/s] ^ I s \og{dls) 



H' n 



As proved in Lemma 11 below, N{s,2L/k) < 0.3{l8ne L/kY^^. Thus, there is a 
constant D2 such that 



(• ^N{s,2L/k)s log{d/s) s \og{d/s) ] ^ D'^k-'I^^s \og[d/s) , , ^ log(^/5) 
I n ^ n ] ~ n ^ n ' 

Combining these results, we see that under the conditions 2D\s\og{d/s)/ n <k 
and 

p'^^s\og[dls) 
n 

consistent estimation of / is possible. Taking D* = 2D\{\ + D\), we complete 
the proof of the first claim of the proposition. To prove the second assertion, 
we apply Thm. 2. Since it holds that 2]-, > 7-, + 1 > - TTcfe^' 

deduce from Thm. 2 that there are some constants D3 and D4 such that if 



2U:— 1__ < 



f /wil^^^AOH^iC^v /Sl0g(rf/S)l ^ 

D-i\ V \ > K 

[ n ^ n ] 
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then consistent estimation of / is impossible. Since the s -dimensional L2 ball 
with radius ./sf contains the Lgo ball of radius N{s,D4/k) > (Dsj^K-"*/^ for 
some constant D5. By rearranging different terms, we get the desired result. 

Lemma 11. For every y > 1 andd* G N, Ni{d*,r) < 0.3{9ner)'^*^^. 

Proof. One readily checks that if ||fc||2 < d*Y, then the hypercube centered at 
k with side of length 1 is included in the ball centered at the origin and having 
radius ^/d*r + O.sVd*. Therefore, Niid*,Y) < (/^ + 0.5/^)^*Vol[Brf.(0; 1)], 
where Vol[Bd*{0; 1)] stands for the volume of the unit ball in M^*. Using the well- 
known formula for the latter and the Stirling approximation, for every d* > I, 

we get Vol[Brf.(0;l)] = ^^J^ ^ O-^^^^W^' ™s implies that iVi(rf*,r) < 
Q_^^9rd^^dV2iine20V2 ^ o_3(9^g^)dV2 and the result foUows. □ 

APPENDIX F: SUPPLEMENT TO "TIGHT CONDITIONS FOR CONSISTENCY 
OF VARIABLE SELECTION IN THE CONTEXT OF HIGH 
DIMENSIONALITY" 

This supplementary material provides the proofs of Theorem 3, Proposition 
7, Corollary 3 and Lemma 10 of the article "Tight conditions for consistency of 
variable selection in the context of high dimensionality". 

F.l. Proof of Theorem 3. To ease notation, we write /instead of /„ through- 
out this proof. The empirical Fourier coefficients can be decomposed as follows: 

0k = 0k + Zk, where ^ = iV ^^f(x,) andzfc = -V (28) 

If, for a multi index k, 6k = 0, then the corresponding empirical Fourier coeffi- 
cient will be close to zero with high probability. To show this, let us first look at 
what happens with z k 's. We have, for every real number x, 

2 

P(|zfc|>j(;|A-i,...,A:„)<exp('-;^l ^k^Sm,d* 

with 

i=l o'- oniin 

Therefore, it holds that maxfcgs^ P(|zfc| > x\Xi, . . . < exp{-ng^^^x'^ /Aa'^). 
This entails that by setting Ai = {Qa'^d*\og{24:\fQd I d*)l ng^^^^l'^ and by using 
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the inequalities (cf. Lemma 13 below) 



d* j 



V, 



Card(S„,dO<0.3| 
we get 

P(^max |zfc|>Ai|Xi,...,X„)< ^ p(|zfc| > 



< 



Next, we use a concentration inequality for controlling large deviations of 6k 's 
from dk's. Recall that in view of the definition 9k — \ 'YIi=\ '^1^)^ f(-X"i)> we have 

E(4) = Ok- The boundedness of f yields |^ggi^f(X,)| < ^flLoolgmm. Further- 
more, the bound V = Var(^gg^f(X;)) < jP{x)^dx < combined 
with Bernstein's inequality implies that 



Pfl4 - dk\ > t) < 2exp f ^ ] 

<2exp( ^ ), \ft>0. 

V 4I^ + ?Ioogmin^ 

Let US define = 41^ ( ^hen, 

f 4Lld*\og{24V^d/d*) 
Pirn - Okl > A2 < 2exp ^ ! , f^.v^L* 1/. 

The first inequality in the main condition of the theorem implies that the de- 
nominator in the exponential is not larger than 21,2- Hence, P[maxkes,„a* l^fc " 
9k\ > A2) < 0.6/[24V^d/d*y\ Let 

j</i = { max |zfc|<Ai} and j^2 = { max |6'fe|<A2}. 

One easily checks that 

P(r ^ /') <P(j^i') +P(j^/) < l.2/[24V^d/d*f\ 
Since d*>l and > 1, the last inequality implies 
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As for the converse inclusion, we have 

P(/$z^/)<pf3je/s.t. max |0a;|<a1 

<lf3jG/s.t. max \ek\<2x\+V{j^') + ¥{j^'). 

We show now that the first term in the last line is equal to zero. If this was not 
the case, then for some value jo we would have Qj^ > k and \6k\< 2 A, for all 
k e Sm,d* such that kj^ ^ 0. This would imply that 

Q;„,m,d*= X Ol<AX^N{d*,m^ld*). 

On the other hand, 

Ld* 



\\k\\2>m ||fc||2>m;e/ 



Remark now that the choice of the truncation parameter m proposed in the 
statement of the proposition implies that Qj^ — QjQ,m,d* < kI2. Combining these 
estimates, we get 



K 

2 

which is impossible since Qy„ > k 



Qjo < + 4X^N{d*, m^/d*), 



F.2. Proof of Proposition 7. Let M = (^,) and let {fo,fi,. .. Jm} be a set in- 
cluded in Si. Let /i, . . . , /m be all the subsets of {!,..., d] containing exactly d* 
elements somehow enumerated. Let us set fo = and define f(, for £ 7^ 0, by its 
Fourier coefficients {Oj^ik^Z'^] as follows: 



1, fc = (fci, . . . , fcd) = (lie/^, . . . , Irfe/J. 

, 0, Otherwise. 



Obviously, all the functions ft belong to S and, moreover, each has Ii as spar- 
sity pattern. One easily checks that our choice of implies Ji!^(Pf^,PfJ = n\\f( — 
folli = Therefore, if alogM = alog (^») > n, the desired inequality is satisfied. 
To conclude it suffices to note that log (^,) is larger than or equal to d* log{d/d*) = 
d*{logd-logd*). 
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F.3. Proof of Lemma 10. First we specify the notation. Let y — [yu : fc e S} 
and, for <o e {±1}^, 0(0 = {-^(^k '■ k e S}. The likelihood ratio between = Pf^ 
and Po — Pfo is 



-{y) = exp 



+ n0a)-y 



where, for a vector z ^Z'^ , = z • z. Then, the likelihood ratio between and 

Fois 

As a consequence, simple algebra yields 

f exp(nA2) + exp(-nA2)^ 



<exp{\S\n^A'^) 



The last inequality follows from the elementary inequality cosh(x) < , which 
can be checked by decomposing the functions cosh(x) and e^^ in Taylor series 
and comparing the corresponding terms. 

F.4. Proof of Corollary 3. Let us set y = L/k and Yr = {i + '^]T- Applying 
Theorem 1 with A = 2, we get that 

p[T„^j)<3{2ed/d*r'^' 

provided that the condition 



8j2N[d*,rT)d*log{2ed/d*) l6d*log{2ed/d*) kt 

— H -< . (29) 

n n 1 + T 

is satisfied for some t > 0. Clearly, when d ^oo, for every d* > I, it holds that 

{2ed/d*)-'^'' <d-^'^^0. 

Therefore, it is sufficient to check that the assumptions 

loglogc? , d* 2 

lim f ^ = 0, limsup < — — (30) 

n^oo logn „-^oo logn \f(Z-^) 

imply that (29) is true for sufficiently large values of n. We will show that the 
left-hand side of 29 tends to as n ^ oo. 

First remark that (30) yields 

logd<n^^\ d*<n^l^ 
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for sufficiently large n. Therefore, 

16<i*log(2e<i/<i*) 
lim = 0. 

«— >CX) fl 

Second, by continuity of the mappings f Ij- and Zj, there exists t > such 
that 

limsup^ <-— 7<^limsup — [ " - 1 < 0. (31) 

n^oo logn \j,[Zj^) n^oo V 2logn J 

This inequality, combined with the relation \ogN[d*,Yr) = d*\y^[zY^)il + o(l)) 
(cf. Eq. (12) in the manuscript [12]), implies that 

^^^1 ^jN[d*,rT)d*log{2ed/d*)^ 

f logN[d*,rT) , , logt^* , loglog(2gd) 
< log n ■ 1 + — 1 ; 

V 21ogn 21ogn 21ogn 

f d*LJzrJ logd* d* loglogd 

V 21ogn d* 21ogn logn 



tends to a negative number tends to 

—00. 

This entails that the first term in the left-hand side of (29) tends to zero, which 
completes the proof. 

F.5. Some technical lemmas. 

Lemma 12. For every y > 1 the numbers Ni{d* ,y) = {fc e Z''* : \\k\\\ < d*Y] 
admit the following upper bound: 

iVi(rf*,7)<0.3(97rer)''*''2. 

Proof. One readily checks that if ||fc||2 < d*y, then the hypercube centered at 
k with side of length 1 is included in the ball centered at the origin and having 
radius m + 0.5 V^. Therefore, 

Ni{d*,r) < (V^+ o.5V^rvol[5rf.(0; 1)], 

where Vol[5d*(0; 1)] stands for the volume of unit ball in M^*. Using the well- 
known formula for the latter and the Stirling approximation, for every d* > 1, 
we get: 

VoI[Bd*(0;l)]-— — — e 0.3^ '—^ ;0.4-' ' 



d*Y{d*l2) V ^/2dF ^* 
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NM\y)<\-^\ Vol[Bd-(0;l)]< 0.4(^1 < 0.3(97ier)'^ 

This proves the result. □ 

Lemma 13. LetSm.d* = {fc e : ||fc||2 < m, & ||fc||o < d*]. Ifm = y/yd* with 
r>l, then Card(S;„,d.) < 0.3(24/frf/rf*)'^'. 

Proof. It is clear that 

CardiSm,d']<{^^,^Niid*,r)- 
Combining this with the inequality 

[d*J-['d*J ' 
and the previous lemma, we get 

The claim of the lemma follows now from the inequlaity V9nee < 24. □ 
Lemma 14. For every pair of positive integers [d , d*) such thatd* < d: 

Proof. We will proceed by induction over rf*. If rf* = 1, we have 

Assume that the inequality 

Xej[fj^i^ed/dr 
is true for some l<d* <d. Let us show that this entails the inequality 
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It holds that 




' 



<i+i 



and the result follows. 



□ 
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